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Abstract 

A Dynamic Instruction Set Computer (DISC) has 
been developed that supports demand-driven modifi- 
cation of its instruction set Implemented with par- 
tially reconfigurable FPGAs, DISC treats instructions 
as removable modules paged in and out through par- 
tial reconfiguration as demanded by the executing pro- 
gram. Instructions occupy FPGA resources only when 
needed and FPGA resources can be reused to imple- 
ment an arbitrary number of performance- enhancing 
application- specific instructions. DISC further en- 
hances the functional density of FPGAs by physi- 
cally relocating instruction modules to available FPGA 
space. 

1 Introduction 

Developing customized stored-program processors 
is a convenient design technique that combines the 
enhanced performance of application-specific circuits 
with the flexibility of general-purpose programmable 
processors. Application-specific instruction sets, cus- 
tomized I/O and optimized control can substantially 
improve the performance of even the simplest pro- 
grammable processors. FPGAs provide an excellent 
implementation platform for application specific pro- 
cessors because of the quick development time and 
simplified design process. In addition, SRAM based 
FPGAS provide the ability to reconfigure more than 
one distinct application-specific processor on a single 
device. 

A number of general purpose processors have 
been developed to show the feasibility of implement- 
ing a processor architecture on an FPGA[5, 7, 17]. 
Several custom processors have successfully demon- 
strated the advantages of adding specialized hard- 
ware to general purpose processor cores. Applica- 
tion areas for these processors include digital audio 
processing[16], systems of linear equations[17], and 
statistical physics [12]. 

One limitation of building customized processors 
on FPGAs is the lack of hardware resources avail- 
able for specialized instruction sets. A few hardware- 
intensive instruction modules can quickly consume all 
the resources of even the largest FPGAs available to- 
day. Reconfiguring an FPGA to replace idle circuitry 
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during application execution can provide more hard- 
ware resources than is available on a one-time config- 
ured FPGA. This technique, known as run- time re- 
configuration (RTR), has been shown to increase the 
functional density of reconfigurable FPGAs[6]. The 
DISC processor uses RTR to ameliorate FPGA hard- 
ware limitations and provide an essentially limitless 
application-specific instruction set. 

Early attempts in modifying a processor instruc- 
tion set involved a writable control store and gen- 
erating custom micro-code for each application[14]. 
The PRISM project extended this idea by augmenting 
the instruction set of a standard RISC processor with 
application-specific instructions on a tightly coupled 
FPGA. Hardware images of these instructions are ex- 
tracted and compiled from the source code transpar- 
ent to the user[2J. The WASMII project discusses a 
more dynamic approach that involves swapping hard- 
ware compute configurations in and out of the FPGA 
resource as demanded by the data-flow token [9]. 

The DISC processor implements each instruction in 
the instruction set as an independent circuit module. 
The individual instruction modules are paged onto the 
hardware in a demand-driven manner as dictated by 
the application program. Hardware limitations are 
eliminated by replacing unused instruction modules 
with usable instructions at run-time. An application 
running on DISC contains source code, indicating in- 
struction ordering, and a library of application-specific 
instruction circuit modules. 

This paper will begin by describing the techniques 
used to implement DISC. These include partial recon- 
figuration, relocatable hardware, and the linear hard- 
ware model. The architecture of the DISC processor 
will be presented along with several example custom 
instructions. The DISC processing system, including 
software and hardware platform, will be described. 
The paper will conclude by presenting results from 
an algorithm implemented on DISC. 

2 Partial FPGA Reconfiguration 

DISC takes advantage of partial FPGA configura- 
tion to implement dynamic instruction paging. Partial 
reconfiguration provides the ability to configure a sub- 
section of an FPGA while remaining logic operates 
unaffected. Although all S RAM-based FPGAs can be 
reconfigured in-circuit, only the CAL[1], Atmel[3], and 
National Semiconductor [13] FPGAs support the abil- 
ity to partially reconfigure hardware resources. 
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Although few partially reconfigurable systems 
have actually been implemented, several have been 
proposed such as hardware mult i- tasking [10], a 
multi-phase serial communication algorithmfll], a 
data acquisition system [4], and a self-reconfiguring 
processor[8]. In addition, caching logic to in- 
crease hardware efficiency in standard digital sys- 
tems has been proposed using partially reconfigurable 
FPGAs[15]. 

DISC uses partial configuration to implement 
custom-instruction caching. Instruction modules are 
implemented as partial configurations and individu- 
ally configured on DISC as demanded by the applica- 
tion program. Before initiating execution of a custom- 
instruction, DISC queries the FPGA for the pres- 
ence of the custom-instruction configuration. If the 
custom-instruction is on the FPGA, execution is initi- 
ated. Otherwise, program execution pauses while the 
custom-instruction is configured on the FPGA. 

As a typical program executes, custom-instructions 
are configured onto the FPGA until all available hard- 
ware is consumed. When all hardware is used by the 
custom-instructions, new custom-instruction modules 
may not be configured on the FPGA until enough ex- 
isting hardware is removed. By replacing the oldest 
custom-instruction modules on the FPGA with newer 
modules, the FPGA serves as a cache of the most- 
recently used custom-instruction modules. 

2.1 Example 

The following assembly language source code exem- 
plifies the use of partial configuration on DISC: 

begin: 

; instruction INSTA operates on 
; memory location meml 
INSTA meml 
INSTA mem2 

; instruction IKSTB operates on 
;mem3 and ciem2 
INSTB mem3 , mem2 

; "loopback" label defined 
loopback: 

INSTC mem3 

; instruct ion CMP compares 
;meml with mem3 
CMP meml , mem3 

; instruction JNE jumps 
; to loopback if not equal 
JNE loopback 
continue: 
ICJSTD mem3 
INSTB mem2 
INSTE mem3 
end: 

Once each instruc- 

tion in the previous program (INSTA , INSTB , INSTC , 
CMP, JNE, INSTD, and INSTE) has been designed as 
an independent partial configuration, the source code 
representing the program is loaded into DISC and the 
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processor begins execution. The sequencing of instruc- 
tions on a small FPGA may execute and configure as 
follows: 

Operation Instruction 

Configure INSTA on FPGA 
Execute first INSTA 
Execute second INSTA 
Configure INSTB on FPGA 
Execute first INSTB 
Configure INSTC on FPGA 
Execute first INSTC 
Execute CMP (always available) 
Execute JNE (always available) 
*STC until JNE fails) 

FPGA full, remove oldest moduh 
Configure INSTD 
Execute INSTD 
Execute second INSTB 
FPGA full, remove oldest modul< 
Configure INSTE 
Execute INSTE 



In the previous example, it is assumed that the first 
five instructions (INSTA, INSTB, INSTC, CMP, and 
JNE) consume all available space on a single FPGA. 
Partially configuring the FPGA allows two additional 
instructions (INSTD and INSTE) to execute on an oth- 
erwise full FPGA. 

2.2 Advantages 

Partial configuration provides a number of advan- 
tages for DISC over conventional configuration meth- 
ods. First, idle instruction modules can be removed to 
make room for other usable modules. The ability to 
Teplace instruction modules in the system at run- time 
allows the implementation of an instruction set much 
larger than is possible on a single one-time configured 
FPGA. 

Second, configuration time is substantially reduced. 
Although the DISC FPGA could be completely con- 
figured every time a new instruction is needed, config- 
uration overhead can be dramatically reduced by con- 
figuring only the requested instruction, Reducing the 
size of hardware to configure significantly reduces the 
configuration bit-stream. Configuration bit-stream re- 
ductions for DISC instruction modules fall between 4k 
and £ of a complete FPGA configuration. With a sig- 
nificantly smaller bit-stream, the corresponding con- 
figuration time is reduced. In an environment of run- 
time configuration, reducing the configuration time 
will limit the reconfiguration overhead. 

Third, system state can be saved on the FPGA dur- 
ing configuration. Conventional configuration tech- 
niques prevent the preservation of system state during 
configuration by destroying the contents of all flip- 
flops. Implementing DISC with conventional configu- 
ration methods would require the saving and restor- 
ing of system state (program counter, register values, 
etc.) every time a configuration occurs. To prevent 
the time-consuming process of saving and restoring 
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state, DISC implements a global controller that re- 
mains on the FPGA at all times. 

In summary, partial configuration allows DISC 
to implement an essentially infinite instruction set 
in hardware with limited configuration and state- 
preserving overhead. 

3 Relocatable Hardware 

The ability to partially con- 
figure custom-instruction modules allows DISC to im- 
plement an important strategy - relocatable hardware. 
Relocatable hardware, implemented only in partially 
configurable FPGAs, provides the ability to relocate or 
make placement decisions of partial configurations at 
run- time. Although hot essential for a general purpose 
processor, it is used on DISC to substantially improve 
run- time hardware utilization. 

Sub-modules in traditional digital systems require 
a single fixed location in hardware because of strict 
global and local physical constraints. Because sub- 
mo dules in traditional systems are not paged in and 
out of hardware, a fixed location does not pose any 
problems and global optimizations can be made on the 
static circuitry to improve hardware utilization. In a 
run- time partial reconfigurable system, however, fixed 
locations for partial configurations can pose serious 
performance problems. 

If DISC modules are designed for a single physi- 
cal location, instructions in the library will inevitably 
overlap each other on the hardware. Two overlap- 
ping instructions can never operate properly on the 
FPGA at the same time. If two overlapping instruc- 
tions are used frequently together in an application 
program, the configuration overhead needed to replace 
the instructions quickly becomes the system bottle- 
neck. DISC removes these problems by designing each 
custom-instruction module for multiple locations on 
the FPGA. 

The flexibility of multiple locations for DISC 
custom-instructions significantly improves run- time 
utilization. Instruction modules are initially config- 
ured on the FPGA as close as possible to avoid wasted 
hardware between modules. Once the hardware space 
is full, additional instruction modules are placed in 
locations where older unneeded instruction modules 
currently lie. Relocatable hardware allows run-time 
constraints and conditions to dictate instruction mod- 
ule placement for optimal hardware utilization. 

Relocatable hardware is implemented by design- 
ing custom- instruction modules around a nrmly de- 
fined global context. A global context provides physi- 
cal placement positions and a communication network 
necessary for these modules to operate correctly. The 
global context partitions the available hardware into 
an array of potential placement locations for the relo- 
catable instruction modules. The communication net- 
work is provided at each placement location to insure 
adequate communication between the global controller 
and the instruction modules at any location. 

In order to design instruction modules that fit 
within the global context, all instruction modules 
must be physically independent from each other. The 
physical layout of any instruction module must have 



no affect on the physical layout or placement of any 
other module in the library. 

4 Linear Hardware Space 

DISC implements relocatable hardware in the form 
of a linear hardware model. As the name suggests, the 
model is based on a linear, one-dimensional hardware 
space. The two-dimensional grid of configurable logic 
cells are organized as an array of rows: location is 
specified by vertical location and module size is spec- 
ified by module height (in rows). 

The global context for the linear hardware model 
consists of a uniform communication network and a 
global controller. The communication network is con- 
structed by running each global signal vertically across 
the die and spreading the global signals across the 
width of the die parallel to each other (see Figure 1). 
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Figure 1: Linear Hardware Space. 

The communication network provides access to 
global resources for all instruction modules and per- 
forms intermodule communication. The global con- 
troller specifies the communication protocol, controls 
global resources (such as I/O and global state) and 
monitors circuit execution. The global controller and 
the communication network remain in the same loca- 
tion throughout application execution to preserve the 
global context. 

To gain access of all global signals, sub- modules 
within a linear hardware space are designed horizon- 
tally, across the width of the FPGA. The modules 
lie perpendicular to the global communication signals 
for full access of all global signals regardless of their 
vertical placement (see Figure 2). Although all sub- 
modules must span the entire width of the FPGA, each 
module may consume an arbitrary amount of hard- 
ware by varying its height. 
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Figure 2: Simplified Custom Instruction Module, 



Relocatable circuit modules communicate as estab- 
lished by the global protocol and thus operate properly 
at any vertical location. In a run-time environment, 
these circuit modules can be relocated as needed to 
optimize the available hardware space. 

5 DISC Architecture 

The DISC architecture implements relocatable 
hardware with the linear hardware model on a sin- 
gle National Semiconductor CLAy31 FPGA coupled 
to an external RAM. The CLAy31 provides a 56 x 
56 array of fine-grain logic cells allowing 56 complete 
rows in the linear hardware space. A complete proces- 
sor is made by coupling a global controller to a library 
of custom-instruction circuit modules (see Figure 3). 
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Figure 3: DISC Linear Hardware Space. 

5.1 Global Controller 

The global controller provides the circuitry for op- 
erating and monitoring global resources such as the ex- 
ternal RAM, I/O, the internal communication network 



and global state. The global controller consumes ten 
complete rows (approximately 1/6 of the chip) leav- 
ing 46 rows available for custom-instruction modules. 
The physical layout of the global controller, estimated 
at 1007 gates, along with the communication network 
is seen in Figure 4. 




Figure 4: DISC Global Controller Layout. 

The architecture of the global controller is seen 
in Figure 5 and is comprised of the following sub- 
modules: 

• Data Register (DR): stores intermediate results, 
provides inter-module communication buffering 
and assists in complex address generation (8 bits), 

• Address Register (AR): provides standard ad- 
dressing modes for memory access (16 bits), 

• Program Counter (PC): provides the sequencing 
capability of the processor (16 bits), 

• Status Register (SR): stores internal state of the 
processor (4 bits), 

• Instruction Register (IR): stores the opcode of 
the current instruction (8 bits), 

• Global Control Unit (GCU): contains the cir- 
cuitry necessary to preserve communication pro- 
tocol, sequence through processor states, and in- 
terface with I/O. 
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Figure 5: DISC Global Controller Architecture. 

The global controller provides a consistent com- 
munication interface and standard protocol for all 
custom-instructions at every vertical location The 
global signals available to the custom-instructions in- 
clude the following: 

• Data Register Value: accesses contents of Data 
Register (8 bits), 
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• Data Register Feedback: provides new values for 
Data Register (8 bits), 

• Memory Address: allows address generation con- 
trol by custom-instructions (16 bits), 

• Memory Data: allows bi-directional access of 
memory data by custom-instructions (8 bits), 

• Status Signals: provides control capability for 
custom-instructions (4 bits), 

• Instruction Register: provides opcode of current 
instruction (8 bits). 

The global controller is also responsible for sequenc- 
ing through the instruction cycles for the custom- 
instruction modules. The following instruction cycles 
are implemented by the global controller: 

• Instruction Fetch (IF), 

• Operand Fetch (OF), 

• Halt Processor (HP), 

• Custom Cycle (CC), 

• Instruction Execution (EX). 

The IF cycle stores the current program memory 
into the instruction register and increments the pro- 
gram counter. The OF cycle stores the current pro- 
gram byte into the address register and also incre- 
ments the program counter, The HP cycle causes all 
processor resources to remain idle and is used dur- 
ing configuration. The CC cycle is used by complex 
custom-instruction modules for adding additional cy- 
cles and has no affect on global resources. The EX 
cycle loads the value of the data register with the con- 
tents of the data register feedback path. 

Each instruction in the library operates in one of 
two possible instruction cycle sequences: standard 
and custom. The standard instruction sequence fol- 
lows a simple three-cycle execution: IF, OF, and EX. 
Any instruction that completes its computation or 
function in a single clock cycle, such as basic arith- 
metic and logic operations, will operate with this se- 
quence. 

The custom-instruction sequence offers additional 
cycles for complex custom- instructions. The custom 
sequence begins with the following two cycles: IF 
followed by OF. The sequence then varies by insert- 
ing as many CC cycles as necessary to complete a 
complex application-specific operation. The custom- 
instruction sequence completes with the EX instruc- 
tion cycle. The custom-instruction module has com- 
plete control over the number of CC cycles needed for 
a particular function. Some instructions add as few as 
one cycle, while others require thousands of cycles for 
a single operation. Figure 6 displays the two instruc- 
tion sequences. 

The global control unit contains a number of de- 
fault instructions necessary for controlling global re- 
sources. These instructions are used for sequencing, 
status control, and memory transfer and include the 
following: 

• set carry: sets carry bit in status register, 

• clear carry: clears carry bit in status register, 

• store data register: store data register in memory, 
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Figure 6: DISC Instruction Sequences. 



• load data register: load data register from mem- 
ory, 

• conditional jump: jump with carry not set. 

Each of these instructions follow the standard in- 
struction sequence of three cycles. These instructions, 
coupled with the custom-instruction library designed 
, for a particular application, provide the complete in- 
struction set of the processor. An application can im- 
plement an instruction set of any size by paging in- 
struction modules in a demand-driven manner from 
the instruction library. 

5.2 Custom- instruct ion Modules 

Custom-instruction modules vary in size and com- 
plexity, but each is designed to fit within the global 
context described above. Specifically, each module 
contains a decode and a data-path unit. Complex 
modules contain additional control structures. 

The decode unit assigns a specific op-code to the 
custom instruction and is responsible for acknowledg- 
ing its presence to the global controller. The decode 
unit compares the contents of the IR for a match 
against its own opcode during the OF cycle. On a 
positive match the module signals the global controller 
that the hardware is present and instruction sequenc- 
ing continues. 

The data-path is responsible for providing the 
proper connections to the global communication net- 
work and adhering to the established communication 
protocol. Instruction modules not executing refrain 
from sending any signals on the communication chan- 
nel to prevent the corruption of other operating in- 
structions. The data-path unit provides a new value 
for the data register during the EX stage. Most in- 
structions perform their function by modifying the 
DR. 

Several custom-instruction modules of varying size 
have been implemented on DISC. These vary from a 
simple single row shifter to a complex edge-detection 
module of 34 rows. Table 1 shows the current instruc- 
tions available for DISC. The circuit layout for the 
Adder/Subtracter module is seen in Figure 7. 

6 System Operation 

The DISC processor was implemented on a PC- 
ISA custom board made exclusively for the study. 
The board includes static bus interface circuitry, two 
CLAy31 FPGAs, and memory. A configuration con- 
troller is implemented on the first FPGA to monitor 
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Table 1: Sample Custom Instruction Modules. 
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Figure 7: DISC Adder/Subtracter Custom Module 
Layout. 



processor execution and request instructions from the 
host. DISC is implemented on the second FPGA and 
the application program memory is stored in the adja- 
cent memory (see Figure 8). The board operates under 
a UNIX-basea operating system and is controlled by 
a host device driver. 
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Upon receiving a request for an instruction mod- 
ule, the host evaluates the current state of the DISC 
FPGA hardware and chooses a physical location for 
the requested module. The physical location is chosen 
based on available FPGA resources and the existence 
of idle instruction modules. If possible, tne instruc- 
tion module is loaded in an FPGA location not cur- 
rently occupied by any other instruction module. If no 
empty hardware locations are available, a simple least- 
recently-used (LRU) algorithm is used to remove idle 
hardware. The host modifies the bit-stream of the 
requested hardware module to reflect the placement 
changes. The hardware module is then configured on 
the DISC platform by sending the new configuration 
to the system. Figure 9 provides a simplified flow chart 
of DISC instruction execution. 
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Figure 8: DISC System. 

Performance has not been a main consideration as 
DISC was implemented primarily to study dynamic 
instruction set modification through partial reconfig- 
uration. As a research tool, the processor is 8 bits 
and operates at the host bus speed of 7.5 MHz (max- 
imum operating speed calculated at 12 MHz). Pro- 
cessor widths and operating speeds can be increased 
as device densities increase and tool enhancements be- 
come available. 

A DISC application is initiated by first, loading the 
program memory with the target application, and sec- 
ond, configuring the DISC FPGA with the global con- 
troller. During execution, the processor validates the 
presence of each instruction in the hardware. If the in- 
struction requested by the application program does 
not exist on the hardware, the processor enters a halt- 
ing state and requests the instruction module from the 
host. 



Figure 9: DISC Instruction Execution. 

One drawback of partially configuring the device 
during run-time is the overhead caused by continually 
reconfiguring instruction modules. The current board 
configures the DISC processor by sending the config- 
uration bit-stream one bit per bus transfer over the 
PC-ISA bus. Operating at a maximum transfer rate 
of 1.5 Mb/sec, the PC-host is capable of configuring 
one row in 600 us. This represents 4511 processor cy- 
cles or 1500 simple instruction executions for each row 
configured. By removing the current system board 
and bus limitations, configuration speeds improve by 
a factor of 64 and operate at the device maximum of 
12 MB/sec. 

Custom instruction modules should remain resident 
in the processor for long periods of time to decrease the 
reconfiguration overhead. In addition, custom instruc- 
tion modules should provide enough performance im- 
provement over a sequence of general purpose ALU in- 
structions to justify the cost of reconfiguration at run- 
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time. The following application example will demon- 
strate this tradeoff. 

7 Application Example 

A simple image mean filter was developed as both 
a sequence of general purpose instructions and as an 
application specific hardware module to demonstrate 
the performance improvements gained by tailoring the 
hardware to the application. Both demonstrations 
calculate the mean value of each pixel in an image, 
9(2, y)> by obtaining an average over a 3x3 neighbor- 
hood as follows: 

1 1 1 

m=— 1 n=— 1 

A coefficient of | was used to simplify the design. The 
128 x 64 grey scale image in Figure 10 was used as the 
test image for both cases. 



simple instructions used in the general purpose ap- 
proach. 

The MEAN instruction module calculates the aver- 
age of a 3x3 neighborhood through the use of a sliding 
window as seen in Figure 11. Each numbered element 
of the sliding window represents a pixel register in the 
custom module. Instead of loading the entire window 
from memory at each pixel, register values are shifted 
to represent a sliding window (see Figure 12). Only 
registers 3, 6, and 9 are loaded at each new pixel. 
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Figure 11: Sliding Pixel Window. 

With the window registers loaded, the custom in- 
struction module adds all nine pixel values in parallel 
with eight custom adders as seen in Figure 12. The di- 
vision by eight is achieved by shifting the results three 
bit positions. 




Figure 10: Original Test Image. 

7.1 General Purpose Approach 

The general purpose approach required four in- 
structions not found in the processor core: add, sub- 
tract, shift, and enhanced addressing modes. These 
additional modules comprised a total of 8 rows, leav- 
ing 38 rows free for other custom instruction modules. 

Execution of the algorithm centered in the in- 
ner loop calculation of the 3x3 neighborhood mean 
value. Calculating each pixel value involved individu- 
ally adding each pixel of the neighborhood. Many of 
the instructions used for this summing operation in- 
volved address calculation and pointer manipulations. 
Computation of each pixel finishes with three shifts 
for the division by eight. 

Complete processing of a pixel required an aver- 
age 160 instructions or 560 clock cycles. Processing 
the. complete image, including overhead, required 4.59 
Mclocks or 610 ras (7.5 MHz). 

7.2 Application Specific Approach 

The application specific approach significantly im- 
proves performance of the algorithm by assuming con- 
trol of address generation, buffering pixel values, and 
pipelining the arithmetic. With 31 rows of hardware, 
the extra registers, arithmetic operators and control 
logic consume significantly more hardware than the 



Figure 12: Dataflow of MEAN Instruction Module. 

The MEAN instruction requires only 7 clock cycles 
to evaluate each pixel of the image. The clock cycles 
are scheduled as follows: 

1. Load register 3 

2. Load register 6 

3. Load register 9 

4. Wait (add delay to parallel add) 

5. Write results to image memory 

6. Calculate new address 

7. Shift register window 

Reducing the pixel calculation to seven clock cy- 
cles and eliminating much of the address calculation 
overhead reduces the clock count from 4.59M in the 
general purpose case to 57k for an 80 times speedup. 
Operating at 7.5 MHz, the image is filtered in 7.6 ms. 
Figure 13 displays the image filtered with the MEAN 
custom instruction. 
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Figure 13: Test Image Filtered Through MEAN Cus- 
tom Instruction. 



7.3 Configuration Overhead 

Because the cost of reconfiguring the application- 
specific instruction module is so high, configuration 
overhead must be considered when comparing the two 
approaches. The 31 row MEAN instruction requires 
an additional 140 kcycles for configuration, raising the 
total cycle count to 197 kcycles. The MEAN configu- 
ration overhead represents 71% of the total operating 
time. If device configuration speeds are maximized, 
this configuration overhead is reduced to 16% of the 
total operating time. 

The extra four modules needed for the general pur- 
pose approach require only 36 kcycles for configura- 
tion. This represents less than 1% of the total operat- 
ing time. When considering the high- cost of configura- 
tion in total operating time, the MEAN filter custom 
instruction provides a 23 times speedup to the general 
purpose approach (see Table 2). 





General 
Purpose 


Application 
Specific 


Rows 




31 


Operation Cycles 


4.59M 


S7k 


Raw Speedup 


1 


80 


Area*Time 


36.7M 


1.8M 


Configuration Cycles 


36k 


140k 


Total Cycles 


4.63M 


197k 


""Actual Speedup 


1 


23.5 



Table 2: Performance Comparison between General 
Purpose and Application Specific Approaches. 

8 Conclusions 

The DISC processor successfully demonstrates that 
application specific processors with arbitrarily large 
instruction sets can be be constructed on partially 
reconfigurable FPGAs. The relocatable hardware 
model improved run- time utilization of FPGA re- 
sources and the linear hardware model provided a con- 
venient framework for relocating custom instruction 
modules. DISC demonstrates the general concept of 
alleviating density constraints of FPGAs by partially 
reconfiguring a device at run- time. 



Although the techniques of partial configuration, 
relocatable hardware, and the linear hardware model 
were implemented as a general purpose processor, 
they offer similar advantages to other digital archi- 
tectures. They may enhance the usefulness of FPGA 
co-processors by providing demand-driven computa- 
tion. In addition, these techniques may allow FPGA 
based computing machines to operate in more dy- 
namic environments such as multi-tasking operating 
systems. Any digital architecture that could benefit 
from demand-driven hardware may find these tech- 
niques useful. 
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